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Abstract: This paper presents an adaptive version of the Hill estimator 
based on Lespki’s model selection method. This simple data-driven index 
selection method is shown to satisfy an oracle inequality and is checked to 
achieve the lower bound recently derived by Carpentier and Kim. In order 
to establish the oracle inequality, we derive non-asymptotic variance bounds 
and concentration inequalities for Hill estimators. These concentration in¬ 
equalities are derived from Talagrand’s concentration inequality for smooth 
functions of independent exponentially distributed random variables com¬ 
bined with three tools of Extreme Value Theory: the quantile transform, 
Karamata’s representation of slowly varying functions, and Renyi’s charac¬ 
terisation for the order statistics of exponential samples. The performance 
of this computationally and conceptually simple method is illustrated using 
Monte-Carlo simulations. 
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1. Introduction 

The basic questions faced by Extreme Value Analysis consist in estimating the 
probability of exceeding a threshold that is larger than the sample maximum 
and estimating a quantile of an order that is larger than 1 minus the recipro¬ 
cal of the sample size. In words, they consist in making inferences on regions 
that lie outside the support of the empirical distribution. In order to face these 
challenges in a sensible framework, Extreme Value Theory (EVT) assumes that 
the sampling distribution F satisfies a regularity condition. Indeed, in heavy- 
tail analysis, the tail function F = 1 — F is supposed to be regularly varying 
that is, \im.T^ao F{tx) / F{t) exists for all x > 0. This amounts to assume the 
existence of some 7 > 0 such that the limit is x~^F for all x. In other words, if 
we define the excess distribution above the threshold r by its survival function; 

* Research was partially supported by the ANR14-GE20-0006-01 project AMERISKA net¬ 
work 
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X I—>■ Ft(x) = F(x)/F(t) for a: > T, then F is regularly varying if and only if F^. 
converges weakly towards a Pareto distribution. The sampling distribution F is 
then said to belong to the max-domain of attraction of a Frechet distribution 
with index 7 > 0 (abbreviated in F S MDA( 7 )) and 7 is called the extreme 
value index. 

The main impediment to large exceedance and large quantile estimation prob¬ 
lems alluded above turns out to be the estimation of the extreme value index. 
Since the inception of Extreme Value Analysis, many estimators have been de¬ 
fined, analysed and implemented into software. Hill [1975] introduced a simple, 
yet remarkable, collection of estimators: for k < n, 



where > ... > A(„) are the order statisties of the sample Xi ,..., A„ (the 
non-increasing rearrangement of the sample). 

An integer sequence (fc„) is said to be intermediate if lim„_,,oo k^ = 00 while 
lim„_>oo kn/n = 0. It is well known that F belongs to MDA( 7 ) for some 7 > 0 if 
and only if, for all intermediate sequences (fc„), ^(kn) converges in probability 
towards 7 [de Haan and Ferreira, 2006, Mason, 1982]. Under mildly stronger 
conditions, it can be shown that •\/^( 7 (fc„) —E 7 (fc„)) is asymptotically Gaussian 
with variance 7 ^. This suggests that, in order to minimise the quadratic risk 
E[( 7 (A;„) — 7 )^] or the absolute risk E l 7 (fc„) — 7 ], an appropriate choice for 
has to be made. If kn is too large, the Hill estimator 7 (A:„) suffers a large bias 
and, if is too small, j{kn) suffers erratic fluctuations. 

As all estimators of the extreme value index face this dilemma [see Beirlant 
et ah, 2004, de Haan and Ferreira, 2006, Resnick, 2007, and references therein], 
during the last three decades, a variety of data-driven selection methods for 
kn has been proposed in the literature (see Hall and Weissman [1997], Hall 
and Welsh [1985], Danielsson et al. [2001], Draisma et al. [1999], Drees and 
Kaufmann [1998], Drees et al. [2000], Grama and Spokoiny [2008], Carpentier 
and Kim [2015] to name a few). A related but distinct problem is considered 
by Carpentier and Kim [2014]: constructing uniform and adaptive confidence 
intervals for the extreme value index. 

The rationale for investigating adaptive Hill estimation stems from compu¬ 
tational simplicity and variance optimality of properly chosen Hill estimators 
[Beirlant et al., 2006]. 

The hallmark of our approach is to combine techniques of EVT with tools 
from concentration of measure theory. As up to our knowledge, the impact of 
the concentration of measure phenomenon in EVT has received little attention, 
we comment and motivate the use of concentration arguments. Talagrand’s con¬ 
centration phenomenon for products of exponential distributions is one instance 
of a general phenomenon: concentration of measure in product spaces [Ledoux, 
2001, Ledoux and Talagrand, 1991]. The phenomenon may be summarised in 
a simple quote: functions of independent random variables that do not depend 
too much on any of them are almost constant [Talagrand, 1996a]. 
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The concentration approach helps to split the investigation in two steps: the 
first step consists in bounding the fluctuations of the random variable under con¬ 
cern around its median or its expectation, while the second step focuses on the 
expectation. This approach has seriously simplified the investigation of suprema 
of empirical processes and made the life of many statisticians easier [Koltchinskii, 
2008, Massart, 2007, Talagrand, 1996b, 2005]. To point out the potential uses 
of concentration inequalities in the field of EVT is one purpose of this paper. In 
statistics, concentration inequalities have proved very useful when dealing with 
estimator selection and adaptivity issues: sharp, non-asymptotic tail bounds can 
be combined with simple union bounds in order to obtain uniform guarantees 
of the risk of collection of estimators. Using concentration inequalities to inves¬ 
tigate adaptive choice of the number of order statistics to be used in tail index 
estimation is a natural thing to do. 

In the present setting, tail index estimators are functions of independent 
random variables. Talagrand’s quote raises a first question: in which way are 
these tail functionals smooth functions of independent random variables? We do 
not attempt here to revisit the asymptotic approach described by [Drees, 1998b] 
which equates smoothness with Hadamard differentiability. Our approach is 
non-asymptotic and our conception of smoothness somewhat circular, smooth 
functionals are these functionals for which we can obtain good concentration 
inequalities. 

In this paper, we combine Talagrand’s concentration inequality for smooth 
functions of independent exponentially distributed random variables (Theorem 
2.15) with three traditional tools of EVT: the quantile transform, Karamata’s 
representation for slowly varying functions, and Renyi’s characterisation of the 
joint distribution of order statistics of exponential samples. This allows us to 
establish concentration inequalities for the Hill process {Vk{^{k) — E'^{k))k) 
(Theorem 3.3) in Section 3.1. 

In Section 3.2, we build on these concentration inequalities to analyse the 
performance of a variant of Lepki’s rule defined in Sections 2.3 and 3.2: Theorem 
3.8 describes an oracle inequality and Corollary 3.12 assesses the performance 
of this simple selection rule under a mild assumption on the so-called von Mises 
function. Note that the condition is less demanding than the regular variation 
condition on the von Mises function that has often been assumed when looking 
for adaptive tail index estimators (notable exceptions being [Carpentier and 
Kim, 2015] and [Grama and Spokoiny, 2008]). It reveals that the performance 
of Hill estimators selected by Lepski’s method matches known lower bounds (see 
Section 2.4) that is, they suffer the loss of efficiency which is inherent to this 
problem, but not more. 

Proofs are given in Section 4. Finally, in Section 5, we examine the perfor¬ 
mance of this adaptive Hill estimator for finite sample sizes using Monte-Carlo 
simulations. 
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2. Background, notations and tools 

2.1. The Hill estimator as a smooth tail statistics 

The quantile function F'^ is the generalised inverse of the distribution function 
F. The tail quantile function of F is a non-decreasing function defined on (1, oo) 
by 17= (1/(1-F))^, or by 

U{t)=mf{x : F{x) > 1 — 1/t} = F^(l — 1/t) . 

In this text, we use a variation of the quantile transform that fits EVT: if 
E is exponentially distributed, then ?7(exp(F)) is distributed according to F. 
Moreover, by the same argument, the order statistics > ... > are 

distributed as a monotone transformation of the order statistics > • ■ • > 
Y(„) of a sample of n independent standard exponential random variables. 

(C/(e^a)),...,C/(e^("))) . 

Thanks to Renyi’s representation for order statistics of exponential samples, 
agreeing on = 0 , the rescaled exponential spacings h/i) — T( 2 ), ■ ■ ■, iiX{i) ~ 

y(j_|_i)), ... ,{n — l)(Y(ji_i) — !/„)), nl/n) are independent and exponentially dis¬ 
tributed. 

The quantile transform and Renyi’s representation are complemented by 
Karamata’s representation for slowly varying functions. Recall that a function 
L is slowly varying at infinity if for all cc > 0, L{tx)/L{t) = x° = 1. 

The von Mises condition specifies the form of Karamata’s representation [see 
Resnick, 2007, Corollary 2.1] of the slowly varying component t~^U{t) of U{t). 
Definition 2.1 (vON Mises condition). A distribution function F belonging 
to MDA( 7),7 > 0, satishes the von Mises condition if there exist a constant 
^0 ^ 1 ; a constant c = U{t(f)tf^'^ and a measurable function iq on (l,oo) such 
that, for t > to 

U{t) = cFexp(^J^ ^ds 

with lims_,,oo ri{s) = 0. The function rj is called the von Mises function. 

In the sequel, we assume that the sampling distribution F G MDA( 7 ), 7 > 0, 
satishes the von Mises condition with tg = 1) von Mises function rj and dehne 
the non-increasing function rj from [l,oo) to [0,oo) by ri{t) = sup^x |??(s)|. In 
the text, we assume that 77 ( 1 ) < 00 . 

Combining the quantile transformation, Renyi’s and Karamata’s representa¬ 
tions, it is straightforward that, under the von Mises condition, the sequence 
of Hill estimators is distributed as a function of the largest order statistics of a 
standard exponential sample. 

Proposition 2.2. The vector of Hill estimators {^{k))k<n is distributed as the 
random vector 

( 7 -I-??(eT+Hi+i))) du I (2.3) 

^ k<n 
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where Ei,..., are independent standard exponential random variables while, 
for i < n, Y(i) = /j ** distributed like the ith order statistic of an 

n-sample of the exponential distribution. 

For a fixed k < n, a, second distributional representation is available, 

^ r y]] f (7 + dw (2.4) 

where Ei,..., Ek and F^fc+i) are defined as in Proposition 2.2. 

This second, simpler, distributional representation stresses the fact that, con¬ 
ditionally on j{k) is distributed as a mixture of sums of independent ran¬ 

dom variables approximately distributed as exponential random variables with 
scale 7 . This distributional identity suggests that the variance of 7 (A:) scales like 
7 ^/fc, an intuition that is corroborated by analysis, see Section 3.1. 

The bias of ^{k) is connected with the von Mises function p by the next 
formula 


E^{k) — 7 = E 

■ pOO 

/ e“'"77 (e^(''+i)e'') du 

= E 

/ -,,2 


L^o J 


Jl V 


Henceforth, let b be defined on (1, 00 ) by 



-qitv) 


dv = t 



r]{v) 


du. 


(2.5) 


The quantity b{t) is the bias of the Hill estimator j{k) given F(X(j,_|_i)) = 1/t. 
The second expression for b shows that b is differentiable with respect to t (even 
though 77 might be nowhere differentiable) and that 

(,'(() = . 

The von Mises function governs both the rate of convergence of U{tx)/U{t) 
towards a:'*', or equivalently of E(tx)/E{t) towards x~^^'^, and the rate of con¬ 
vergence of |E 7 (fc) — 7 I towards 0 . 


2.2. Frameworks 

The difficulty in extreme value index estimation stems from the fact that, for any 
collection of estimators, for any intermediate sequence (fc„), and for any 7 > 0 , 
there is a distribution function F G MDA( 7 ) such that the bias |E 7 (A:„) — 7 I 
decays at an arbitrarily slow rate. This has led authors to put conditions on the 
rate of convergence of U[tx)/U(t) towards x^ as t tends to inhnity while a; > 0, 
or equivalently, on the rate of convergence of F{tx)/F{t) towards x~^l^. These 
conditions have then to be translated into conditions on the rate of decay of 
the bias of estimators. As we focus on Hill estimators, the connection between 
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the rate of convergence of U{tx)/U{t) towards x'^ and the rate of decay of the 
bias is transparent and well-understood [Segers, 2002]: the theory of O-regular 
variation provides an adequate setting for describing both rates of convergence 
[Bingham et ah, 1987]. In words, if a positive function g defined over [l,oo) is 
such that, for some a € R, for all A > 1, limsup^ sup 3 ,g[]^_A] (?(te)/g(f) < oo, g 
is said to have bounded increase. If g has bounded increase, the class Olig is 
the class of measurable functions / on some interval [a, oo), a > 0 , such that as 
t ^ oo, f{tx) — f(t) = 0{g{t)) for all x > 1 . 

For example, the analysis carried out by Carpentier and Kim [2015] rests on 
the condition that, if F G MDA( 7 ), for some C > 0, D ^ 0 and p < 0, 


F{x) 

X-l/7 


< DxpF . 


( 2 . 6 ) 


This condition implies that G OWg with g{t) = F [Segers, 2002, 

p. 473]. Thus, under the von Mises condition. Condition (2.6) implies that the 
function J)°°(p(s)/s)ds belongs to Ollg with g{t) = F. Moreover, the Abelian 
and Tauberian theorems from [Segers, 2002] assert that (-i](s)/s)ds G Ollg if 
and only if |E 7 (A:„) — 7 I = 0{g{n/kn)) for any intermediate sequence (fc„). 

In this text, we are ready to assume that if F G MDA( 7 ) and satisfies the 
von Mises condition, then, for some C > 0 and p < 0 and t > 1, 

< CtP. 

This condition is arguably more stringent than (2.6). However, we do not want 
to assume that rj satisfies a regular variation property. This would imply that 
1 1 —)■ \b{t)\ is p-regularly varying. 

Indeed, assuming as in [Hall and Welsh, 1985] and several subsequent papers 
that F satisfies 

F(x) = Cx-^F 1^1 + Dg.p /1 + o(x^/t')) (2.7) 

where C > 0, F ^ 0 are constants and p < 0, or equivalently, [Csorgd, Deheuvels, 
and Mason, 1985, Drees and Kaufmann, 1998] that U satisfies 

(1 -h -h o(F)) 

(which entails that t] is regularly varying) makes the problem of extreme value 
index estimation easier (but not easy). These conditions entail that, for any 
intermediate sequence (A:„), the ratio |E[ 7 (A:„) — 7 ]|/(u/fc„)^ converges towards 
a finite limit as n tends to 00 [Beirlant et ah, 2004, de Haan and Ferreira, 2006, 
Segers, 2002]. This makes the estimation of the second-order parameter a very 
natural intermediate objective [see for example Drees and Kaufmann, 1998]. 


2.3. Lepski’s method and adaptive tail index estimation 

The necessity of developing data-driven index selection methods is illustrated 
in Figure 1, which displays the estimated standardised root mean squared error 
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(rmse) of Hill estimators 


E 



21 1/2 


as a function of k for four related sampling distributions which all satisfy the 
second-order condition (2.7) with diflFerent values of the second-order parame¬ 
ters. 



k 


Distribution 

-t1 


t2 


t4 


t10 


Fig 1. Estimated standardised RMSE as a function of k for samples of size 10000 from Stu¬ 
dent’s distributions with different degrees of freedom v = 1,2,4,10. All four distributions 
satisfy Condition (2.7) with \p\ = 2/i^. The increasing parts of the lines reflect the values of 
p. RMSE is estimated by averaging over 5000 Monte-Carlo simulations. 


Under this second-order condition (2.7), Hall and Welsh proved that the 
asymptotic mean squared error of the Hill estimator is minimal for sequences 
(fc*)„ satisfying 

kl^K{C,D,p) „2|pI/(i-S2|p|) 

with K{C\D,p) = (C'2|pl(l-h Since C > 0, D ^ 0 and 

the second-order parameter p < 0 are usually unknown, many authors have 
been interested in the construction of data-driven selection procedures for 
under conditions such as (2.7). A great deal of ingenuity has been dedicated to 
the estimation of the second-order parameters and to the use of such estimates 
when estimating first order parameters. 
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As we do not want to assume a second-order condition such as Condition 
(2.7), we resort to Lepski’s method which is a general attempt to balance bias 
and variance. 

Since its introduction [Lepski, 1991], this general method for model selection 
has been proved to achieve adaptivity and to provide one with oracle inequalities 
in a variety of inferential contexts ranging from density estimation to inverse 
problems and classification [Lepski, 1990, 1991, 1992, Lepski and Tsybakov, 
2000]. Very readable introductions to Lepski’s method and its connections with 
penalised contrast methods can be found in [Birge, 2001, Mathe, 2006]. In EVT, 
we are aware of three papers that explicitly rely on this methodology: [Drees 
and Kaufmann, 1998], [Grama and Spokoiny, 2008] and [Carpentier and Kim, 
2015]. 

The selection rule analysed in the present paper (see Section 3.2 for a precise 
definition) is a variant of the preliminary selection rule introduced in [Drees and 
Kaufmann, 1998] 


tin{rn) = min 


k G {2,.. 


n}: 


max Vi\^{i)-l{k)\ > 

2<i<k 



( 2 . 8 ) 


where (r„)„ is a sequence of thresholds such that \/ln In n = o{rn) and r„ = 
o{y/n), and 7 (i) is the Hill estimator computed from the (i + 1) largest order 
statistics. The definition of this “stopping time” is motivated by Lemma 1 from 
[Drees and Kaufmann, 1998] which asserts that, under the von Mises condition, 

max v^| 7 (j) — E [ 7 ( 1 )] I = Op (Vmlnn) . 

In words, this selection rule almost picks out the largest index k such that, for 
all i smaller than fc, 7 (fc) differs from 7 ( 1 ) by a quantity that is not much larger 
than the typical fluctuations of 7 (i). This index selection rule can be performed 
graphically by interpreting an alternative Hill plot as shown on Figure 2 [see 
Drees et al., 2000, Resnick, 2007, for a discussion on the merits of alt-Hill plots]. 

The goal of Drees and Kaufmann [1998] is not to investigate the performance 
of the preliminary selection rule defined in Display (2.8) but to design a selection 
rule K„(r„), based on K„(r'n), that would asymptotically mimic the optimal 
selection rule A:* under second-order conditions. 

Our goal, as in [Carpentier and Kim, 2015, Grama and Spokoiny, 2008], is to 
derive non-asymptotic risk bounds without making a second-order assumption. 
In both papers, the rationale for working with some special collection of estima¬ 
tors seems to be the ability to derive non-asymptotic deviation inequalities for 
j{k) either from exponential inequalities for log-likelihood ratio statistics or from 
simple binomial tail inequalities such as Bernstein’s inequality [see Boucheron 
et ah, 2013, Section 2.8]. 

In models satisfying Condition (2.7), the estimators from [Grama and Spokoiny, 
2008] achieve the optimal rate up to a ln(n) factor. Carpentier and Kim [2015] 
prove that the risk of their data-driven estimator decays at the optimal rate 
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Fig 2. Lepski’s method illustrated on a alt-Hill plot. The plain line describes the sequence 
of Hill estimates as a function of index k computed on a pseudo-random sample of size 
n = 10000 from Student distribution with 1 degree of freedom (Cauchy distribution). Hill 
estimators are computed from the positive order statistics. The grey ribbon around the plain 
line provides a graphic illustration of Lepski’s method. For a given value ofi, the width of the 
ribbon is 2rn'^{i')jy/i. A point {k,'y{k)) on the plain line corresponds to an eligible index if 
the horizontal segment between this point and the vertical axis lies inside the ribbon that is, if 
for alii, 30 <i<k, | 7 (/c) — 7 ( 2 )! < rn'y{i)/y/i. If rn were replaced by an appropriate quantile 
of the Gaussian distribution, the grey ribbon would just represent the confidence tube that is 
usually added on Hill plots. The triangle represents the selected index with Vn = ^2.1 In In n. 
The cross represents the oracle index estimated from Monte-Carlo simulations, see Table 2. 


7 t,IpI/(i+ 2 |pI) factor in models satisfying 

Condition (2.6). 

We aim at achieving optimal risk bounds under Condition (2.6) using a simple 
estimation method requiring almost no calibration effort and based on main¬ 
stream extreme value index estimators. Before describing the keystone of our 
approach in Section 2.5, we recall the recent lower risk bound for adaptive ex¬ 
treme value index estimation. 


2.4- Lower bound 

One of the key results in [Carpentier and Kim, 2015] is a lower bound on the 
accuracy of adaptive tail index estimation. This lower bound reveals that, just 
as for estimating a density at a point [Lepski, 1991, 1992], or point estimation 
in Sobolev spaces [Tsybakov, 1998], as far as tail index estimation is concerned. 
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adaptivity has a price. Using Fano’s Lemma, and a Bayesian game that ex¬ 
tends cleanly in frameworks of [Grama and Spokoiny, 2008] and [Novak, 2014], 
Carpentier and Kim were able to prove the next minimax lower bound. 

Theorem 2.9. Let po < attd v G [0,e/(l -I- 2e)]. Then, for any tail index 
estimator 7 and any sample size n sueh that M = [Inn] > e/v, there exists a 
probability distribution P such that 

i) P G MDA( 7 ) with 7 > 0, 

ii) P meets the von Mises condition with von Mises function p satisfying 

V{t) < li'’ 


Hi) 


for some pG [po,0), 


P 


I7-7I > ^7 


w In In n 
n 


|P|/(1+2|P|)'1 


> 


1 

1 


and 

Cp /win Inn 
- 4(1 -f 2e) n 

with Cp = l- exp (- 2 (i+2|p|)0 ■ 

Using Birge’s Lemma instead of Fano’s Lemma, we provide a simpler, shorter 
proof of this theorem (see Appendix E). 

The lower rate of convergence provided by Theorem 2.9 is another incentive 
to revisit the preliminary tail index estimator from [Drees and Kaufmann, 1998]. 
However, instead of using a sequence (r„)„ of order larger than Vlnlnn in order 
to calibrate pairwise tests and ultimately to design estimators of the second- 
order parameter (if there are any), it is worth investigating a minimal sequence 
where is of order Vln In n, and check whether the corresponding adaptive 
estimator achieves the Carpentier-Kim lower bound (Theorem 2.9). 

In this paper, we focus on r„ of the order -s/lnlnn. The rationale for imposing 
of the order Vln In n can be understood by the fact that, even if the sampling 
distribution is a pure Pareto distribution with shape parameter 7 {F{x) = 
{x/t)~^/'^ for a; > r > 0 ), if 


Ep 


I7-7I 


yp|/(i+2|p|) 


limsupr„/(7'\/2 Inlnn) < 1 , 


the preliminary selection rule will, with high probability, select a small value of 
k and thus pick out a suboptimal estimator. This can be justified using results 
from [Darling and Erdos, 1956] (see Appendix A for details). 

Such an endeavour requires sharp probabilistic tools. They are the topic of 
the next section. 
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2.5. Talagrand’s concentration phenomenon for products of 
exponential distributions 


Deriving authentic concentration inequalities for Hill estimators is not straight¬ 
forward. Fortunately, the construction of such inequalities turns out to be possi¬ 
ble thanks to general functional inequalities that hold for functions of indepen¬ 
dent exponentially distributed random variables. We recall these inequalities 
(Proposition 2.10 and Theorem 2.15) which have been largely overlooked in 
statistics. A thorough and readable presentation of these inequalities can be 
found in [Ledoux, 2001]. We start by the easiest result, a variance bound that 
pertains to the family of Poincare inequalities. 

Proposition 2.10 (Poincare inequality for exponentials, [Bobkov and Ledoux, 
1997]). If g is a differentiable function over R" and Z = g{Ei,..., En) where 
El,..., En are independent standard exponential random variables, then 


Var(Z) < 4E ||Vgl| 


Remark 2.11. The constant 4 can not be improved. 

The next corollary is stated in order to point the relevance of this Poincare 
inequality to the analysis of general order statistics and their functionals. Recall 
that the hazard rate of an absolutely continuous probability distribution with 
distribution F is: h = f /F where / and F = 1 — F are the density and the 
survival function associated with F, respectively. 

Corollary 2.12. Assume the distribution of X has a positive density, then the 
kth order statistic -A(fc) satisfies 


Var(X(fe)) < 


i—k 


Kx^k)f 






where C can be chosen as 4. 


Remark 2.13. By Smirnov’s Lemma [de Haan and Ferreira, 2006], C can not be 
smaller than 1. If the distribution of X has a non-decreasing hazard rate, the 
factor of 4 can be improved into a factor 2 [Boucheron and Thomas, 2012]. 

Bobkov and Ledoux [1997], Maurey [1991], Talagrand [1991] show that smooth 
functions of independent exponential random variables satisfy Bernstein type 
concentration inequalities. The next result is extracted from the derivation of 
Talagrand’s concentration phenomenon for product of exponential random vari¬ 
ables in [Bobkov and Ledoux, 1997]. 

The definition of sub-gamma random variables will be used in the formulation 
of the theorem and in many arguments. 

Definition 2.14. A real-valued centred random variable X is said to be sub¬ 
gamma on the right tail with variance factor v and scale parameter c if 


In Ee^'^ < 


2(1-cA) 


for every A 


such that 


0 < A < 1/c. 
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We denote the collection of such random variables by r_|_(u, c). Similarly, X is 
said to be sub-gamma on the left tail with variance factor v and scale parameter 
c if —X is sub-gamma on the right tail with variance factor v and tail parameter 
c. We denote the collection of such random variables by r_(u, c) and r+(u, c) n 
r_(u,c) by r±(u, c). 

If X — EX G r+(u,c), then for all S G (0,1), with probability larger than 

l-<5, ^_ 

X < EX + \/2v In (1/(5) -I- c In (1/5) . 

The entropy of a non-negative random variable X is defined by EntlXl = 
E[XlnX] - EX In EX. 


Theorem 2.15. Assume that g is a differentiable function on R" withmaxi \dig\ < 
oo. Let Z = g{Ei ,..., En) where Ei,, En are n independent standard expo¬ 
nential random variables and c < 1. Then, for all A such that 0 < Amax^ \dig\ < 


c, 


Ent 


gA(Z-EZ)' 

< “Af 

. 

1 — c L 


„A(Z-EZ) 


l|V5l( 


Let V be the essential supremum o/ljV^jp, then Z is sub-gamma on both tails 
with variance factor Av and scale factor max^ \dig\. 


Again, we illustrate the relevance of these versatile tools on the analysis of 
general order statistics. This general theorem implies that if the sampling dis¬ 
tribution has non-decreasing hazard rate, then the order statistics satisfy 
Bernstein type inequalities [see Boucheron et ah, 2013, Section 2.8] with variance 
factor 4/fcE [l//i(X(/j))^] (the Poincare estimate of variance) and scale parame¬ 
ter (supj, l/h(x))/k). Starting back from the Efron-Stein-Steele inequality, the 
authors derived a somewhat sharper inequality [Boucheron and Thomas, 2012]. 

Corollary 2.16. Assume the distribution function E has non-decreasing haz¬ 
ard rate h that is, U o exp is and concave. Let Z = g{Ei,...,En) = 
(U o exp) (X(r=fe ^6 distributed as the kth order statistic of a sample dis¬ 

tributed according to F. Then, Z is sub-gamma on both tails with variance factor 
4/fc (1 -|- 1/k) E[l/h{Zff] and scale factor l/(A:infa; h(x)). 


This corollary describes in which way central, intermediate and extreme or¬ 
der statistics can be portrayed as smooth functions of independent exponential 
random variables. This possibility should not be taken for granted as it is non 
trivial to capture in a non-asymptotic way the tail behaviour of maxima of in¬ 
dependent Gaussians [Boucheron and Thomas, 2012, Chatterjee, 2014, Ledoux, 
2001]. In the next section, we show in which way the Hill estimator can fit into 
this picture. 


3. Main results 

In this section, the sampling distribution F is assumed to belong to MDA( 7 ) 
with 7 > 0 and to satisfy the von Mises condition (Definition 2.1) with bounded 
von Mises function g. 
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3.1. Variance and concentration inequalities for the Hill estimators 


It is well known that, under the von Mises condition, if (kn) is an intermediate 
sequence, the sequence ^/ki^ {^(kn) — ^l{kn)) converges in distribution towards 
A/’( 0 , 7 ^), suggesting that the variance of 7 (fc„) scales like 7 ^/fc„ [see Beirlant 
et ah, 2004, de Haan and Ferreira, 2006, Geluk et ah, 1997, Resnick, 2007]. 

Proposition 3.1 provides us with handy non-asymptotic bounds on Var[ 7 (fc)] — 
7 ^/fc using the von Mises function. 

Proposition 3.1. Let j{k) be the Hill estimator computed from the (k + 1) 
largest order statistics of an n-sample from F. Then, 


— [?7 (e^<''+^))] < Var[ 7 (fc)] 

K 


- y < y E [?7 +l^[v 



The next Abelian result might help in appreciating these variance bounds. 

Proposition 3.2. Assuming that 77 is p-regularly varying with p < 0, then, for 
any intermediate sequence {kn), 


lim 

n—^oo 


fcn Var(7(fcn)) - 7 ^ 
ri{nlkn) 


27 

(l-p) 2 ' 


We may now move to genuine concentration inequalities for the Hill estimator. 

The exponential representation (2.3) suggests that the rescaled Hill estima¬ 
tor krj(k) should be approximately distributed according to a gamma(A:, 7 ) dis¬ 
tribution where k is the shape parameter and 7 the scale parameter. There¬ 
fore, we expect the Hill estimators to satisfy Bernstein type concentration in¬ 
equalities that is, to be sub-gamma on both tails with variance factors con¬ 
nected to the tail index 7 and to the von Mises function. Representation (2.3) 
actually suggests more. Following [Drees and Kaufmann, 1998], we actually 
expect the sequence (•\/fc( 7 (fc) — E 7 (fc)))^ to behave like normalized partial 
sums of independent square integrable random variables that is, we believe 
max 2 <fe<„ Vk{j{k) — E 7 (A:)) to scale like VlnInn and to be sub-gamma on both 
tails (see Appendix A). The purpose of this section is to meet these expectations 
in a non-asymptotic way. 

Proofs use the Markov property of order statistics: conditionally on the (J -I- 
l)th order statistic, the first largest J order statistics are distributed as the 
order statistics of a sample of size J of the excess distribution. They consist of 
appropriate invocations of Talagrand’s concentration inequality (Theorem 2.15). 
However, this theorem generally requires a uniform bound on the gradient of the 
relevant function. When Hill estimators are analysed as functions of independent 
exponential random variables, the partial derivatives depend on the points at 
which the von Mises function is evaluated. In order to get interesting bounds, 
it is worth conditioning on an intermediate order statistic. 

Throughout this subsection, let £ be an integer larger than Vlnn and J an 
integer not larger than n. We denote Ei,l < i < n, n independent standard 
exponential random variables and we work on the probability space where all 
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Ei are defined, and therefore consider the Hill estimators defined by Represen¬ 
tation (2.3). As we use the exponential representation of order statistics, besides 
Hill estimators, the random variables that appear in the main statements are 
order statistics of exponential samples. As before, Y)*,) will denote the fcth order 
statistic of a standard exponential sample of size n (we agree on Y(n-i-i) = 0 ). 

The first theorem provides an exponential refinement of the variance bound 
stated in Proposition 3.1. However, as announced, there is a price to pay: state¬ 
ments hold conditionally on some order statistic. This is not an impediment to 
analyse Lepski’s rule using this theorem. Indeed, when analysing Lepki’s rule it 
is sufficient to control the Hill process (•\/i( 7 (f) — E[ 7 (*) | Y(fc+i)])) ■ for indices 
i ranging between £„ (that should not be smaller than ln(n)) and some upper 
bound kn that achieves a certain balance between bias and standard deviation 
(the bias of 7 (fc„) should be of order r„ times the standard deviation, that is 
approximately ^j^Jkn where r„ « 'y/ln(ln(n))). The second clause of next theo¬ 
rem is the cornerstone in the derivation of the risk bounds presented in the next 
section. 

In the sequel, let 

= Cl yin \og2n + c'l, 

where ci may be chosen not larger than 4 and c'l not larger than 34. 

Theorem 3.3. Let T be a shorthand for exp(Y(j_|_i)). For some k such that 
C 2 inn V 32 < £ < fc < J where C 2 > 2, let 

Vi | 7 (i) - E [ 7 ( 1 ) | ^fc+i)] | . 

Then, conditionally on T, 

i) For £ < i < k, 

Vi V{i) - E[ 7 (f) I T]) e r± (^4 (7 -£ 3fj{T)f , (7 -£ 2rj{T))j . 

ii) Let u he such that Jrj{u)‘^ < VVi where rn = V C 3 In In n with C 3 = 2. 
Assume that T > u, then 

e r± (47^(1 + 3r„/yj)2, 7(1 + 2rVVj)lVt) 

and 

Remark 3.4. If T is a pure Pareto distribution with shape parameter 7 > 
0 , then kj{k)/j is distributed according to a gamma distribution with shape 
parameter k and scale parameter 1. Tight and well-known tail bounds for gamma 
distributed random variables assert that 

P I |7(A:) - E [7(fc)] I > ^ (y21n(2/5) + | < 2Y 
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Remark 3.5. First part of Statement ii) reads as: conditionally on r]{T) < 
t'nj/'/J, with probability larger than 1 — 

- E[Z“ I T]| < 7(1 + SrjVl) (^V8ln{2/6) + • 

Combining both parts of Statement ii), we also get that, conditionally on rj{T) < 
t'nf/'/J, with probability larger than 1 — 5, 

< 7(1 + 3r„/v^) + \/81n(2/5) + • 

Remark 3.6. The reader may wonder whether resorting to the exponential rep¬ 
resentation and usual Chernoff bounding would not provide a simpler argument. 
The straightforward approach leads to the following conditional bound on the 
logarithmic moment generating function. 


In E [exp (A - Wik) \ >^(fe+i)])) I Yik+i)] 
(7 -I- 77 (e^<'=+i)))^ 


< 


2k (1 — A (7 -I- 77 (e^<''+i>))) 


-b A (r 7 (e^<'=+^') — 6 (e^(*=+^>)) . 


A similar statement holds for the lower tail. This leads to exponential bounds for 
deviations of the Hill estimator above E[ 7 (fc) | T(fe_|_i)] + ? 7 (e^<''+i)) — b (e^<'=+i)) 
that is, to control deviations of the Hill estimator above its expectation plus a 
term that may be of the order of magnitude of the bias. 

Attempts to rewrite j{k) — E[ 7 (fc) | as a sum of martingale increments 

E[ 7 (A:) I y(j)] — E[ 7 (fc) I Y(j_|_i)], for 1 < f < /c, and to exhibit an exponential 
supermartingale met the same impediments. 

At the expense of inflating the variance factor. Theorem 2.15 provides a 
genuine (conditional) concentration inequality for Hill estimators. As we will 
deal with values of k for which bias exceeds the typical order of magnitudes of 
fluctuations, this is relevant to our purpose. 


3.2. Adaptive Hill estimation 

We are now able to characterise the performance of the variant of the selec¬ 
tion rule defined by (2.8) [Drees and Kaufmann, 1998] with = \/C 3 In In n 
where C 3 = 2. Let £„ = |"c 2 Inn] where C 2 is a constant to be defined below. 

The deterministic sequence of indices (fc„(r„)) is defined (for n large enough) 
by 

knirn) = max e ... ,n}: Vkr]{n/k^) < , (3.7) 

where = k + \j2k\n{\/5) -b 21n(l/5). The sequence (fc„(l))„ is defined by 
choosing = 1. The deterministic sequences (fc„(l)) and (A;„(r„)) achieve spe¬ 
cific balances between bias and variance. In full generality, because rj(t) is just 
an upper bound on the conditional bias 6 (t), it is difficult to precisely connect 


imsart-generic ver. 2011/11/15 file: adaptHillarxiv.tex date: December 16, 2015 









S. Boucheron and M. Thomas/Adaptive Hill estimation 


16 


(fc„(l)) and (fc„(r„)) with the oracle sequence (fc*). We call these two sequences 
the pivotal sequences. In the sequel, kn stands for If the context is not 

clear, we specify fc„(l) or kn{rn)- 

Let l/(2n) < S < 1/4. Recall, from Section 3.1, that = ci yjhv ln(u) + c'^ 
and agree on the shorthands 

ZS = (1 + irnl\/K) + a /8 In ( 2 /( 1 ) + 

and 'zs which is defined by replacing kn by ^n in the definition of zs (zs depends 
on n,d but not on the sampling distribution). In the sequel, C 2 is assumed to 
be chosen so that rn + zs< 9y/c2 ln(n)/10 for n > 1000 and 2/n < d < 1/4 (c 2 
may be chosen not larger than 100 ). 

The index kn is selected according to the following rule: 

k G {£„,... ,n} and Vi G {£„,... ,n} , | 7 (i) - 7 (A:)| < 

\Ji 

where r„((5) = 10(r„ +zs). The quantity r„(i5) scales like ^ln((2/(5) ln(n)). The 
tail index estimator is j{kn)- 

As tail adaptivity has a price (see Theorem 2.9), the ratio between the risk 
of the data-driven estimator ^{kn) and the risk of the pivotal index 7 (A:„( 1 )) 
cannot be upper bounded by a constant factor, let alone by a factor close to 1 . 
This is why in the next theorem, we compare the risk of the empirically selected 
index j{kn) with the risk of the pivotal index j{kn). 

Recall, from Section 3.1, that 

^n = Cl \/ln log 2 n + c) with Ci < 4 and c) < 34 . 



Theorem 3.8. Assume the sampling distribution F G MDA( 7),7 > 0 satis¬ 
fies the von Mises condition with bounded von Mises function rj, and rj{t) = 
sup,>t |?7(s)l- 

Letn > 1000 is large enough so that kn (Definition 3.7) is well defined. Then, 
for 2fn < 6 < Ifi, with probability larger than 1 — 35, 


l-l(kn) < |7-7(^n)| (l + + 




and, with probability larger than 1 — 45, 


l{ku) - 7 


< 


2r’n{5) 


— \/kn 


7(1 -ba(5,n)) 


where 


a{5,n) < 



\/RW A , 3r„(5)y 

rn{6) \ ^ Viu ) 


Remark 3.10. For 0 < 5 < 1/2, 


a(5, n)=o(l) as n —>■ 00 . 


(3.9) 
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Remark 3.11. If the bias b is p-regularly varying (or equivalently, if the von Mises 
function 77 or even fj are regularly varying), then, elaborating on Proposition 1 
from [Drees and Kaufmann, 1998], sequences (fc*) and (A;„(l)) are connected by 

lim^ = (2H)i/(i+2|p|) 

” fc* 

and their quadratic risk are related by 


E[(7-7(fcn(l)))^] 
n E[(7-7(A:*))2] 


2 |p| + l^ 


Moreover, under the second-order assumption, the two pivotal sequences (/c„(l)) 
and (/c„(r„)) are also connected. 

Thus, if the bias is p-regularly varying. Theorem 3.8 provides us with a con¬ 
nection between the performance of the simple selection rule and the perfor¬ 
mance of the (asymptotically) optimal choice. 

Recall that one of the main aims of this paper is to derive performance 
guarantees for the data-driven index selection method kn without resorting to 
second-order assumptions that is, without assuming that the von Mises function 
is regularly varying. The next corollary upper bounds the risk of the preliminary 
estimator when we just have an upper bound on the bias. 

Corollary 3.12. Assume that, for some C > 0 and p < 0, for all t > 1, 


f] {t) < CtP . 


Then, there exists a constant Kc,s,p depending on C, 6 and p such that, with 
probability larger than 1 — A5, 


l{kn) - 7 


< hi-C,S,p 


7 ^ In ((2/(5) In n 


A |p|/(l-H2|p|) 


(1 -I- a{6, n)) 


where a{5,n) is defined in Theorem 3.8. 

This meets the information-theoretic lower bound of Theorem 2.9. 


4. Proofs 

4 . 1 . Proof of Proposition 2.2 

This proposition is a straightforward consequence of Renyi’s representation of 
order statistics of standard exponential samples. 

As F belongs to MDA( 7 ) and meets the von Mises condition, there exists a 
function 77 on (l,oo) with lima;_>.oo pix) = 0 such that 

U(x) = cx'^ exp ^^^ds 
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and 


Then, 


U{e^) = cexp^J {'j + r]{e^))du 


d lA.lnC/(e^«) 
= 12 ^^ 


lnC/(e’^(-+i)) 
2 = 1 ' / 

k „Y,i 




rn) 


(7 + r?(e“))du 


i = l + 

k i-Ei/i 


= / (7 + i?(e“+’^<‘+'>))di 


z=i -0 

k i-E 




4 ..2. Proof of Proposition 3.1 

Let Z = krf{k). By the Pythagorean relation, 

Var(Z) = E [Var (Z | ^(^+ 1 ))] + Var (E[Z | ^(^+ 1 )]) . 

Representation (2.4) asserts that, conditionally on Z is distributed as a 

sum of independent, exponentially distributed random variables. Let E be an 
exponentially distributed random variable. 

Var (Z I Y(fc+i) = y) 

= kVai^E + J r]{e'^~^^)du^ 

= kj'^ + 2kjCov^E,J r]{e^'^^)d'u^ +Yai^J r]{e'^~^^)du^ 

< ^7^ + 2kjfi{e^) + k{ri (e^))^ , 

where we have used the Cauchy-Schwarz inequality and Vard^^ ? 7 (e^''"”)du)< 
ri{ey)^. Taking expectation with respect to V(fc+i) leads to 

E [Var (Z | dfe+i))] < -I- 2 fc 7 E [77 (e''^('=+i>)] -|- fcE 77 

The last term in the Pythagorean decomposition is also handled using elemen¬ 
tary arguments. 

POO 

E[Z I y(fc+i)] = k'j k / e~'^r] du . 

Jo 
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As A(fe+i) is a function of independent exponential random variables {Y(fc_|_i) = 
Sr=fc+i variance of E[Z | Y(fc_|_i)] may be upper bounded using Poincare 

inequality (Proposition 2.10) 

Var {E[Z \ y(fc+i)]) < 4fcE fj (e^<''+i))^ . 

In order to derive the lower bound, we first observe that 

Var(Z) > E [Var {Z \ Y^k+i))] . 

Now, using Cauchy-Schwarz inequality again, 

l-E 

Cov{E, / r]{e^+y)du) 

Jo 

> — 2 A :7 ^ Var ^ y ? 7 (e“'''^)d‘ 

> — 2/c7?7(e^). 


Var {Z I y(fe+i) =y) > - 2^7 


1/2 


4-3. Proof of Theorem 3.3 


In the proof of Theorem 3.3, we will use the next maximal inequality [see 
Boucheron et al., 2013, Corollary 2.6]. Recall the definition of r-|_(w, c) (Def¬ 
inition 2.14). 


Proposition 4.1. Let Zi,..., Zjv be real-valued random variables belonging to 
r+(i;, c). Then 


E 


max Zi 


< V2vhiN -I- c In V . 


Proofs follow a common pattern. In order to check that some random vari¬ 
able is sub-gamma, we rely on its representation as a function of independent 
exponential variables and compute partial derivatives, derive convenient upper 
bounds on the squared Euclidean norm and the supremum norm of the gradient 
and then invoke Theorem 2.15. 

At some point, we will use the next corollary of Theorem 2.15. 


Corollary 4.2. If g is an almost everywhere differentiable function on [R with 
uniformly bounded derivative g' , then ( 7 (Y(^,_|_i)) is sub-gamma with variance 
factor dWg'W^/k and scale factor ||(ji'||oo/fc- 

Proof of Theorem 3.3. We start from the exponential representation of Hill es¬ 
timators (Proposition 2.2) and represent all 7 ( 1 ) as functions of independent 
random variables Ei,..., E ^^..., Ej, Y(j_|_ 2 ) where the Ej,l < j < J, are stan¬ 
dard exponentially distributed and V(j+i) is distributed like the (J-|-l)th largest 
order statistic of an n-sample of the standard exponential distribution. We con¬ 
sistently use the notation Y(fe) = ^ for 1 < fc < J. 
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i^{i) = ^ f / + du 

j=l “'o 

= vie")dv\ . 


\ ’^-‘■(j + 1) / 

Let i' be such that 0 < < i, let us agree on 7 ( 0 ) = 0. Let 

g{E^>+i,...,Ej) = i^{i) - 

'u) 


= /I "fEj + 3 j V{E’) d?;. 


j=z' + l j^i' + l + 

For i' < p < i, as -^^7 ~ p j < p and 0 otherwise, 


dg . ^ dfY^^l^g{e^)dv 


ThZ = z. J- 


dxp . ^ dxp 

p-i 

= 7 + ^ 

j=t'+i ^ 

^ ^ 7y(e^(j)) ^ (i'+ l)? 7 (e^(‘'+i)) 


i=i'+2 

This entails that, for i' < p < i, 
dg 


P 


P 


dXr. 


< 7 + 2 / - 7 - - < 7 + ). 

i=i 


P 


(4.3) 


For i < p < J, 


dg 

dxp 


* d ??(e") dv 


E ^ 

j=i' + l 


dXr, 


^ - (7(e^o))-7(e^o+i))) 


1 —< p 

j=i'+i 


P 

1 


7(e^(5')) + (i' + l)7(e^(‘'+i)) — i7(e^<’+^>) 


I j=i'+2 


(7(e^(^'>) — ?7(e^(‘+i>)) + i' (7y(e^(’'+i)) — 7(e^<‘+i>)) 

U'=i' + 1 
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This is enough to entail that, for * < p < fc, 

2 ^ , V , 

< — rjieHp )). 

P 

All in all, for 1 < p < fc. 


dg 

dxp 


dg^ 

dxp 


< (7 + 77 (T)) V 2g{T) < 7 + 2g{T). 


(4.4) 


Proof of i) An upper bound on the variance factor for 17 ( 1 ), conditionally on 
T, is obtained by specialising to the case / = 0 and using (4.3) and (4.4) as well 
as the monotonicity of 7 , 


J 


E 



dxp 


p—l p—i+l ^ 

< * ((7 + v{T)f + 4 {v{T)f) < z (7 + mT)f ■ 


Using Theorem 2.15 conditionally on T = exp(y(j_|_i)), we realise that Vi{^{i) — 
E[ 7 (j) I T]) is sub-gamma on both sides with variance factor not larger than 
4 (7 -I- 3r]{T)) and scale factor not larger than 7 -|- 2rj{T). This yields 

P 117(*) - E [7(*) I T] I > I r| < 2e-^ . 


Taking expectation on both sides, this implies that 

P 117(*) - E [7(i) I T] I > I < 2e-«. 

Proof of ii) The proof of the upper bound on E[Z“ | T] in Statement ii) 
from Theorem 3.3 relies on standard chaining techniques from the theory of 
empirical processes and uses repeatedly the concentration Theorem 2.15 for 
smooth functions of independent exponential random variables and the maximal 
inequality for sub-gamma random variables (Proposition 4.1). 

For general i', the variance factor for 17 ( 1 ) — i'jii') is upper bounded by 

j2 

{i - i') {-f + f]{T)f + -^{2f]{T)f < + +i{2f]{T))'^ . 

p—i^l ^ 

Let u be such that Jri{u)^ < where r„ = Vcalnlnn with C 3 = 2. Now, as 
we assume, in the sequel, that T > u, we may use the next upper bound for the 
variance factor of ^ 7 ( 1 ) — (conditionally on F(j_|_i)), 

47 ^ (^(*-*')(l+;^) ■ 
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Recall that 

|7(*) - E [7(*) I Y{k+i)] \ ■ 

As it is commonplace in the analysis of normalised empirical processes [see 
Gine and Koltchinskii, 2006, Massart, 2007, van de Geer, 2000, and references 
therein], we peel the index set over which the maximum is computed. 

Let £„ = {[log 2 (^)J,..., [log 2 (fc)J} and, for all j S Sj = {iW 2^,... ,k A 
2^+1 - 1}. Define as 

= max Vf | 7 (f) - E [ 7 (f) | Y(fe+i)] | ■ 

iGij 

Then, 

E[Z“|y(,+i)] = E[maxz; |y(fc+i)] 

< E[max(Z“ - E[Z; | ^(^+ 1 )]) | ^(^+ 1 )]] + max E[Z; | ^(^+ 1 )]]. 

We now derive upper bounds on both summands by resorting to the maximum 
inequality for sub-gamma random variables (Proposition 4.1). We first bound 
E[Z“ I y(,.+i)], for j G 

Note that direct invocation of Lemma 4.1 and Statement i) shows that 

E[Z“ I Y^k+i)] < 27(1 + ir/Vj) (x/8jln(2) + j ln(2)) . (4.5) 

This bound will be useful for handling small values of j. For j < 11, ln(2) 4 - 

j ln( 2 ) < 16. 

We now handle generic j using chaining. Fix j G 

max Vi \j{i) - E[ 7 (i) | Y(fc+i)] | < ^ max i | 7 (f) - E[ 7 (f) | F(fc+i)] | . 

In order to alleviate notation, let W{i) = i (j(i) — £[ 7 ( 1 ) | Y(fe_|_i)]), for i G Sj. 
For i G Sj, let 

j 

i = 2^ + ^ bm2^~'^ where bm G {0,1} 

m—1 

be the binary expansion of i. Then, for h G {0,... ,j}, let 7 r^(i) be defined by 

h 

irVi) = 2 ^' + ^ 6 ^ 2 ^-™ 

m—1 

so that nj{i) = i, 7ro(i) = 2^ and 0 < 7rf,+i(i) — 7 r/i(f) < 

Using the fact that IF( 7 ro(z)) does not depend on i and that 

E [VF(^o(*)) I W)] =0. 
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we obtain 


maxi (7{i) - E [7(i) | ^(fc+i)]) | Y^k+i) 


= E 

= E 

= E 


max W(i) | Y(fe+i) 


maxW{Trj{i)) -W{7To{i)) \ 
max^(Vb(7r,,+i(i)) - W{Trh{i))) \ Y(^k+i) 


h^O 


i-1 




h^O 


max(Vr(7r;i+i(i)) -W{TThii))) \ ^(fe+i) 

iGOi 


Now, for each h G {0,... — the maximum is taken over 2^ random variables 

which are sub-gamma with variance factor 

and scale factor ( 7 -|- 2 ? 7 (T)) < 7(1 -f 2r„/VJ)- By Proposition 4.1, since i G Sj, 
maxi ( 7 (i) - E [ 7 (i) | ^(fc+i)]) | ^(^+ 1 ) 

< 7 \/8/i2(l“^“i) j]^2 A- \/32/iln2r„ -I- -I- ^ 

2(i-i)/24,i5y§l^+ ?y32^1n(2)j2 + ZElll in 2 ") 

«j Z j 


< 7(1 + ^ 


where we have used 


4.15 > ^ V/i2-^ 


h^O 


r„ < as r„ = ^/ch^\nhi{^, j -f 1 > log2(c2 ln(n)). 

For j > 12, 


maxi (7(i) - E [7(i) | y(fc+i)]) | Y^^+i) 

iGSj 




Finally, for all j G Cn, 


E[^“| W)]<347 ( 1 +^) 
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In order to prove Statement ii), we check that, for each j G Zj is sub¬ 
gamma on the right-tail with variance factor at most 4 (7 -|- ZfjiT))^ and scale 
factor not larger than ( 7 -|- 377 (T))/v^. Under the von Mises condition (Defi¬ 
nition 2 . 1 ), the sampling distribution is absolutely continuous with respect to 
Lebesgue measure. For almost every sample, the maximum defining is at¬ 
tained at a single index i G Sj. Starting again from the exponential representa¬ 
tion and repeating the computation of partial derivatives, we obtain the desired 
bounds. 

By Proposition 4.1, 


inax(Z“ - E[Z“ | ^(^+ 1 )]) | 

J^L.n 


< 


a /8 In \Cr. 


ln|£„ 

A 


(7-f 37?(T)) 


< 4A/ln|£„| (7-|-3?7(r)) 

3r„ \ 

VJ) 




where we have used In |£„| < ln(log 2 (n)) < ln(n) < for n > 2. Combining the 
different bounds leads to the upper bound on E[Z“ | T]. □ 


Proof of Theorem 3.8 


Throughout this proof, let 
Tn = exp (y(fc„+i)) 

= Cl\/lnlog 2 n + c'l where ci, c'l are defined in Section 3.1, 
05 = (1 + 3r„/vC) + V81n(2/<5) + 

05 = (1 + 3r„/vt) + V 8 In (2/(5) + • 

Let us define the events El and E 2 as 


El = |c 2 lnn < i < fc„, v/| 7 (i) - E[ 7 (i) I T„]| < 70 ^ 1 , 

E 2 = jUn - = kn + 21n [1/6) + i/2kn ln(l/(5) . 

The fact that P{E 2 ) > 1—(5 follows from the following reformulation of Propo¬ 
sition 4.3 from [Boucheron and Thomas, 2012] (a proof is given in Appendix D). 

Proposition 4.6. For 5 G (0,1), with probability larger that 1 — (5, 

exp(Y(-/j_|_i)) > ^ with k^ = fc-|-21n(l/^) -|- i/2k ln(l/(5). 

where F(fe+i) is the {k + l)th largest order statistic of an exponential sample of 
size n. 
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By Theorem 3.3, P(£’i | E 2 ) > 1 —Hence, the event Eir\E 2 has probability 
at least (1 — <5)^ >1 — 26. 

Under E 2 , 

i) viTn) < 'yr„l^/k^. 

ii) for all in < i < K, I 7 - E[7(i) | r„]| < f]{Tn). 

The first step of the proof consists in checking that under Ei\>E 2 , the selected 
index is not smaller than A:„. It suffices to check that for all i, k such that in < 
i < k < kn, 


Vi |7(*) - 7(^)1 < rn{S)^{i). 


For all i G {in, • - • ,fc„} , 


so that 


Meanwhile, for all i.k. 


< 

|7 - E[7(i) 1 

Tn] \ + |7(0 

< 

ViTn) + 

JZs 

Vi 


< 

ITn ^ 

1Z5 



y/kn 

Vi 



7(0 ^ 1 _ 

Tn + Zs 


7 


VI ■ 


|7(i) -7(fc)| 


< 


7(1) - E[7(i) I Tn] +|E[7(f) -7(A:) | r„]| + |7(A:) - E[7(fc) | r„]| . 


(I) 


(n) 


(III) 


Under EiD E 2 , for in < i < k ^ kn, 


(i) + (ill) <^zs[/+ ^ 


< ^ 

z ' VkJ - Vl"' ' 


Under E 2 , 


(II) < |E[7(z)-7|T„]| + |E[7-7(fc) |T„]| 

< 2?7(T„) 

< 2jrn/W{. 

Plugging upper bounds on (i), (ii) and (ill), it comes that, under ifi n if 2 , 
for all fc < — 1 and for all i G [in, ■ ■ ■ ,k}, 


^q llji)->ik)\ 
7 


< 2zs + 2r„ . 
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In order to warrant that, under H i?2, for all k < kn and for all i such that 
C2 Inn < i < k, '/i |7(i) — 7(fc)| < r„(5)7(i), it is enough to have 

^zs + r„) < r„((5) ^1 - • 

The last inequality holds because 

2 ( 2:5 + Tn) < TniS) 

by definition of rn{ 6 ). 

Hence, with probability larger than (1 — <5)^, Ei n E 2 is realised, and under 
El n E 2 , kn > kn- ^ 

We now check that if kn > kn, the risk of ^{kn) is not much larger than the 
risk of ^{kn)- 


1-likn) < \l-l{kn)\+ l(kn) -l{kn) 

Therefore, under Ei n i?2, 

l-l(kn) < |7-7(fcn)l ( 1 + + 

Now, consider the event i?i n £^2 H E 3 with 


rniS)-y 

\J kji 


(4.7) 


E. = 


|\/^l7(fcn) - E[7(fcn) \Tn]\< {^ + 3r](Tn)) 

Since, P(il'3 | E 2 ) > 1 — i5, thanks to Statement i) from Theorem 3.3, the event 
i?! n i?2 n E 3 has probability at least (1 — i5)(l — 2S) > 1 — 3^. 

Then, by definition of kn, under E 2 , 

I 7 - E[7(fc„) \Tn]\< f]{Tn) < jrnl\/K- 

Hence, under i?2 (7 £^3, 

l{kn) - 7 < I 7 - E[7(fc„) \Tn]\ + \l{kn) - E[7(fcn) I Tn] \ 


< 


7 

\/kn 




In (2/(5) 

^kn 


Therefore, plugging this bound into (4.7), with probability larger than 1 — 3i5, 
l{kn) - 7 

rn{5) 




27r„((5) 

< —-^{l + a{d,n)), 

V 




1 + 
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where 


a{S,n) = 


rn \/ln(2/<5) / 3r„(^) 

2v^ rn{5) V 


4-5. Proof of Corollary 3.12 

If, for some C > 0 and p < 0, 

then, by the definition of kn, 


\/fcn + 1 


v{t) < cc, 


< c 


{kn + 1 )^ 


s ; > 


which entails that 


ITn 


< c 


+ 1 

Solving this inequality leads to 

and finally to 


(^y/kn + 1 + \/2 ln(l/5) 


2/(l+2|p|) 


„2|p|/(l+2|p|) _ 


2|p|V21n(l/<5) 

1 + 2|p| 


kn>- 

2 


1 l/(l+ 2 |pl) |p|/(i+ 2 |p|) _ 2 ( 2 |p|V 21 n(l/J) \ 

VcJ l + 2 |p| ) 


- 1 . 


Thus, for sufficiently large n, there exists a constant c depending on p, S such 
that 

(2^)'/^'+^'"'^IpI/(i+2|pI), 

Starting from Equation (3.9) of Theorem 3.8, with probability 1 — 3(5, 


7(^n) - 7 


< 167 < 


1 2 ln((2/(5) log 2 n) 


(1 + Q;((5,n)) , 


and, there exists a constant kc,5,p: depending on C, S and p, such that 
^/ ln((2/J)log2n) ^ ^_i/(i+2|p|) ^ ln((2/(5)log2 n) 

Hence, with probability larger than 1 — 43, 

'72 ln((2/(5) log 2 


l{kn) - 7 


< I^C,S,p 


(1 + a{S,n)) 
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5. Simulations 


Risk bounds like Theorem 3.8 and Corollary 3.12 are conservative. For all prac¬ 
tical purposes, they are just meant to be reassuring guidelines. In this numerical 
section, we intend to shed some light on the following issues: 

1. Is there a reasonable way to calibrate the threshold r„((5) used in the 
definition of fc„? How does the method perform if we choose rn{5) close 
to a /2 In ln(?T,)? 

2. How large is the ratio between the risk of 7 (fc„) and the risk of /(fc*) for 
moderate sample sizes? 

The finite-sample performance of the data-driven index selection method de¬ 
scribed and analysed in Section 3.2 has been assessed by Monte-Carlo sim¬ 
ulations. Computations have been carried out in R using packages ggplot2 
[Wickham, 2009], knitr, foreach, iterators, xtable and dplyr [see Wick¬ 
ham, 2014, for a modern account of the R environment]. To get into the details, 
we investigated the performance of index selection methods on samples of sizes 
1000, 2000 and 10000 from the collection of distributions listed in Table 1. The 
list comprises the following distributions 

i) Frechet distributions = exp(a;“^/'>') for a; > 0 and 7 S {0.2, 0.5,1}. 

ii) Student distributions with € (1, 2,4,10} degrees of freedom. 

hi) The log-gamma distribution with density proportional to (ln(a:))^“^x“^“^, 
which means 7 = 1/3 and p = 0. 

iv) The Levy distribution with density a/i/I^tt) exp (—7 = 2 and 
p = —1 (this is the distribution of 1/X^ when X ~ A/'(0,1)). 

v) The H distribution is defined by 7 = 1/2 and von Mises function equal 
to 77 ( 5 ) = (2/s) In 1/s. This distribution satisfies the second-order regular 
variation condition with p = —1 but does not satisfy Condition (2.7). 

vi) Two Pareto change point distributions with distribution functions 

F{x) = 

and 7 G {1.5,1.25}, 7 ' = 1, and thresholds r adjusted in such a way that 
they correspond to quantiles of order 1 — 1/15 and 1 — 1/25, respectively. 

Frechet, Student, log-gamma distributions were used as benchmarks by [Drees 
and Kaufmann, 1998], [Danielsson et ah, 2001] and [Carpentier and Kim, 2015]. 

Table 1, which is complemented by Figure 3, describes the difficulty of tail 
index estimation from samples of the different distributions. Monte-Carlo esti¬ 
mates of the standardised root mean square error (rmse) of Hill estimators 


(7(fc)/7- 1) 



are represented as functions of the number of order statistics k for samples 
of size 10000 from the sampling distributions. All curves exhibit a common 
pattern: for small values of k, the rmse is dominated by the variance term 
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and scales like Xjsjk. Above a threshold that depends on the sampling dis¬ 
tribution but that is not completely characterised by the second-order regular 
variation index, the rmse grows at a rate that may reflect the second-order 
regular variation property (if any) of the distribution. Not too surprisingly, the 
three Frechet distributions exhibit the same risk profile. The three curves are 
almost undistinguishable. The Student distributions illustrate the impact of the 
second-order parameter on the difficulty of the index selection problem. For 
sample size n = 10000, the optimal index for tio is smaller than 30, it is smaller 
than the usual recommendations. For such moderate sample sizes, distribution 
tio seems as hard to handle as the log-gamma distribution which usually fits 
in the Horror Hill Plot gallery. The 1/2-stable Levy distribution and the H- 
distribution behave very differently. Even though they both have second-order 
parameter p equal to —1, the H distribution seems almost as challenging as the 
^4 distribution while the Levy distribution looks much easier than the Frechet 
distributions. The Pareto change point distributions exhibit an abrupt transi¬ 
tion. 


Table 1 

Estimated oracle index fc* and standardised RMSE £[(7 — 7(fcn))^]^'^^/7 for benchmark 
distributions. Estimates were computed from 5000 replicated experiments on samples of size 

10000 . 


d.f. 

7 

P 

K 

RMSE 

Fo.2 

0.2 

1.0 

1132 

3.7e-02 

Fo.5 

0.5 

1.0 

1145 

3.6e-02 

El 

1.0 

1.0 

1155 

3.6e-02 

tl 

1.0 

2.0 

1161 

3.3e-02 

t2 

0.5 

1.0 

341 

6.5e-02 

^4 

0.2 

0.5 

77 

1.6e-01 

ilO 

0.1 

0.2 

15 

5.3e-01 

H 

0.5 

1.0 

130 

l.le-01 

log-gamma 

0.3 

0.0 

213 

1.6e-01 

Stable 

2.0 

1.0 

3172 

2.0e-02 

Pep 

1.5 

0.3 

943 

3.3e-02 

Pep (bis) 

1.2 

0.2 

593 

4.2e-02 


Index kn(rn) was computed according to the following rule 


kn(r„) = min 


: 30 < A: < n and 3i S {30, ... ,k} , | 7 (i) — 'y{k) | > 


rnlii) 

Vi 


1 


(5.1) 


with Tn = a/c In In n where c = 2.1 unless otherwise specified. 

The Frechet, Student, H and stable distributions all fit into the framework 
considered by [Drees and Kaufmann, 1998]. They provide a favorable ground for 
comparing the performance of the optimal index selection method described by 
Drees and Kaufmann [1998] which attempts to take advantage of the second- 
order regular variation property and the performance of the simple selection 
rule described in this paper. 
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k 


Distribution 

- F0.2 

-F0.5 

-FI 

- - 11 
.... X2 
■ - ■ t4 

-t 10 

-H 

• - - • Stable 

- Pep 

- Pep (bis) 

-log-gamma 


Fig 3. Monte-Carlo estimates of the standardised root mean square error ('rmse^ of Hill 
estimators as a function of the number of order statistics k for samples of size 10000 from 
the sampling distributions. 


Index was computed following the recommandations from Theorem 1 

and discussion in [Drees and Kaufmann, 1998] 

C = m + 1)-'/'^' (5.2) 

\{ku[rn)Y ) 

where p should belong to a consistent family of estimators of p (under a second- 
order regular variation assumption), 7 should be a preliminary estimator of 7 
such as 7(v^), C, = .7, and r„ = Following the advice from [Drees and 

Kaufmann, 1998], we replaced |p| by 1. Note that the method for computing 
depends on a variety of tunable parameters. 

Comparison between performances of j(kn{rn)) and are reported in 

Tables 2 and 3. For each distribution from Table 1, for sample sizes n = 
1000,2000, and 10000, 5000 experiments were replicated. As pointed out in 
[Drees and Kaufmann, 1998] , on the sampling distributions that satisfy a second- 
order regular variation property, carefully tuned /c™ is able to take advantage 
of it. Despite its computational and conceptual simplicity and the fact that it is 
almost parameter free, the estimator ^{kn{rn)) only suffers a moderate loss with 
respect to the oracle. When |p| = 1, the observed ratios are of the same order as 
(2 In In n)^/^ « 1.65. Moreover, whereas 7 (fc™) behaves erratically when facing 
Pareto change point distributions, 7(A:„(r„)) behaves consistently. 
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Table 2 ^ 

Ratios between median selected indices knirn) (Lepski), k^ (Drees-Kaufmann) and 
estimated oracle index fc*. 


d.f. 

7 

Tdk 



kn(f'n)/k^ 


n = 1000 

2000 

10000 

1000 

2000 

10000 

Fo.2 

0.2 

0.61 

0.67 

0.94 

2.94 

2.97 

3.47 

Fo.5 

0.5 

1.12 

1.18 

1.45 

2.90 

2.87 

2.91 

Fi 

1 

1.76 

2.05 

2.32 

2.90 

3.10 

2.93 

tl 

1 

1.33 

1.55 

1.98 

2.03 

2.16 

2.16 

t2 

0.5 

1.00 

0.99 

0.91 

3.05 

3.06 

2.96 

t4 

0.25 

1.27 

1.28 

1.18 

5.62 

5.50 

5.30 

tio 

0.1 

2.00 

1.54 

2.28 

13.87 

10.92 

14.12 

H 

0.5 

0.41 

0.35 

0.30 

5.14 

4.97 

4.96 

Stable 

2 

0.97 

0.95 

1.04 

1.43 

1.41 

1.55 

Pep 

1.5 

1.85 

0.45 

0.15 

1.32 

1.21 

1.10 

Pep (bis) 

1.25 

3.29 

3.03 

2.45 

1.83 

1.50 

1.22 

log-gamma 

0.33 

5.13 

7.71 

12.41 

10.50 

12.99 

12.40 


Table 3 

Ratios between median rmse and median optimal rmse. 


d.f. 

7 

RMSE(7(fe 

™))/RMSE(9(fc*)) 

RMSE(7(A:„(r„)))/RMSE(7(A:*)) 

n = 1000 

2000 

10000 

1000 

2000 

10000 

Fo.2 

0.2 

1.12 

1.12 

1.02 

2.06 

2.26 

2.69 

Fo.5 

0.5 

1.03 

1.03 

1.14 

2.12 

2.23 

2.70 

Fi 

1 

1.22 

1.31 

1.59 

2.07 

2.23 

2.64 

tl 

1 

1.26 

1.34 

1.74 

2.31 

2.39 

3.11 

t2 

0.5 

1.11 

1.08 

1.05 

2.06 

2.09 

2.20 

t4 

0.25 

1.10 

1.07 

1.04 

1.85 

1.81 

1.84 

tlO 

0.1 

1.10 

1.09 

1.08 

1.76 

1.72 

1.64 

H 

0.5 

1.28 

1.37 

1.48 

2.15 

2.18 

2.12 

Stable 

2 

1.01 

0.99 

0.98 

1.99 

2.52 

3.60 

Pep 

1.5 

4.25 

1.66 

2.52 

2.50 

2.68 

3.63 

Pep (bis) 

1.25 

3.38 

4.47 

7.45 

2.43 

2.56 

3.10 

log-gamma 

0.33 

1.23 

1.28 

1.39 

1.45 

1.43 

1.37 


Figure 4 concisely describes the behaviour of the two index selection meth¬ 
ods on samples from the Pareto change point distribution with parameters 
7 = 1 . 5 , 7 ' = 1 s-nd threshold r corresponding to the 1 — 1/15 quantile. The 
plain line represents the standardised rmse of Hill estimators as a function 
of selected index. This figure contains the superposition of two density plots 
corresponding to and k{rn). The density plots were generated from 5000 
points with coordinates (fc(r„), | 7 (A:(r „))/7 — 1|) and 5000 points with coordi¬ 
nates (A:™, | 7 (A ;°^)/7 — 1|). The contoured and well-concentrated density plot 
corresponds to the performance of 7 (fc„). The diffuse tiled density plot corre¬ 
sponds to the performance of k^^. Facing Pareto change point samples, the two 
selection methods behave differently. Lepski’s rule detects correctly an abrupt 
change at some point and selects an index slightly above that point. As the con¬ 
ditional bias varies sharply around the change point, this slight over estimation 
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of the correct index still results in a significant loss as far as rmse is concerned. 
The Drees-Kaufmann rule, fed with an a priori estimate of the second-order 
parameter, picks out a much smaller index, and suffers a larger excess risk. 



Fig 4. Risk plot for samples of size 10000 from the Pareto change point distribution with 
parameters 7 = 1.5,7' = t threshold t corresponding to the 1 — 1/15 quantile. The 
concentrated density plot corresponds to points {k{rn), |7(fc(rn))/7— 1|). 


Acknowledgement The authors are thankful to the editor and the referees 
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Appendix A: Calibration of the preliminary selection rule 

Darling and Erdos [1956] establish (among other things) that letting 
denote supfc<„ ~ where Ei, 1 < i < k, are independent 

exponentially distributed random variables, the sequence 

\/2 In In n — ^2 In In n — In In ln(n)/(2-\/2 In Inn)^ converges in distribution 

towards a translated Gumbel distribution. In other words, asymptotically, Zn 
behaves almost like the maximum of In n independent standard Gaussian 
random variables. 


Appendix B: Proof of Corollary 2.16 

Let Z = g{Ei,E^) = (U o exp) ■ Then, 


\dig\ < - sup for i > fc, 

I X n{x) 


and 


llVfff 


i—k 


1 1 

(h o gY 


Let c < 1, then for all A,0 < A < c(fcinfa, ft,(x)). 


< 4 /fc(l + l/fc)E[l/M^)^]A ^ 

2 (1-c) 

Now, start from the first statement in Theorem 2.15, 

2A2 


Ent 


^A(Z-EZ) 


< 


< 


-E 


^A(Z-EZ) 


1 — C 
4A2 1 

2(1 — c) k 
4A2 1 

2(1 — c) k 


l!v/li^ 






^\{Z-EZ) 


h{zY 

^\(Z-EZ) 


h{zy 


where the last inequality follows from Chebychev negative association 
inequality. Hence, 


d 

dA 


lnEe^(^-^^) 


Ent [e-^C^-E-Z)] ^ i 
A^E [e'^('^“'^'^)] ~ 2(1 — c) k \ ' k 


1 + V 1 E 


1 

Kz? 


This differential inequality is readily solved and leads to the corollary. 
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Appendix C: Proof of Abelian Proposition 3.2 


The proof proceeds by classical arguments. In the sequel, we use the almost 
sure representation argument. Without loss of generality, we assume that all 
the random variables live on the same probability space, and that, for any 
intermediate sequence (fc„), •\/^(P(fe„+i) — lii(n/fc„)) converges almost surely 
towards a standard Gaussian random variable. Complemented with dominated 
convergence arguments, the next lemma will be the key element of the proof. 


Lemma C.l. Let rj G RVp,p < 0 and be the {kn + \)th largest order 

statistic of a standard exponential sample, then, for any intermediate sequence 
(kn) and u > 0, 


ry(e“+^('="+i)) 
n^oo r](n/kn) 


= e^“ p.s . 


Proof. Note that 


r](n/k„) vin/kn) 


Then, the result follows since — log(n/fc„) 0 and the convergence 

r]{tx)/r]{t) —t xP is locally uniform on (0, oo). □ 

In order to secure dominated convergence arguments, we will use Drees’s 
improvement of Potter’s inequality [see de Haan and Ferreira, 2006, page 369]. 
For every e,6 > 0, there exists tg = to(e, 6) such that, for t,tx > tg, 

\r]{tx)/r](t) — < x'’emax(x^,x~^). (C.2) 

To prove Proposition 3.2, we start from Representation (2.4): 

7 (^«) = ]—(7 + 7 ■ 

Hn Jo 

By the Pythagorean relation, 

Var( 7 (fc„)) = Var (E[ 7 (fc„) | y(fc„+i)]) + E [Var ( 7 (fc„) | Y(k„+i))] , 
so that 

fc„Var( 7 (fc„)) - 7 ^ 

r]{n/k„) 

Var (E[7(A;„) | V(/c„+i)]) \ >"(fe„+i)) - ^ 

V(n/kn) ^ ” r?(n/fc„) 
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The second summand can be further decomposed using (2.4). 


Var( 7 (fc„)) - 7^ 
riin/kn) 

_ Var (E[ 7 (fc„) | y(fc„+i)]) 
V{n/kn) 

(I) 




Var 


77(e“+^(''"+i)) 

ri{n/kn) 


dw I Y{k„+i) 


(n) 


+ 27 E 


Cov 


E, 


’tiie^+Ykr.+i)) 


(III) 


We check that (i) and (ii) tend to 0 and then that (ill) converges towards a 
finite limit. 

Fix e, i5 > 0 and define M = sup{77(t), t < to}- 

Let An denote the event > lnto(e,<5)}- For n such that 

ln(n/fc„) < 2Into, as is sub-gamma with variance factor l/fc„, 

P{An} < exp (-fc„(ln(n/fc„))^/8) . 

We first check that (ii) tends to 0. Let n be such that n/kn > to and Wn 
denote the random variable — In (n/fc„). Note that, for 0 < A < kn/2^ 

£e^|w^„l < 2e^ . 


Using Jensen’s inequality and Fubini’s Theorem, 

?7(e"+’'A'=,*+i)) 


E 


Var 


/o v{n/kn) 


-dw I 


< E 


^ ( 77(e“~*"^(''"+^U A ^ 


/o 


pOO 

nv 



1 E 


'0 J 

0 

\ 

l>00 




1 E 


'0 J 

0 

\ 


/ 77(e““'"^<''"+i)) 

V rjinlkn) 
'77(e"+'^"n/fc„) 


dudu 


ditdi; 


r]{n/kn) 

We now apply Potter’s inequality (C.2) on the event An with t = n/fc„ > to 
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and tx = > tQ,u > 0 : 


E 


Var 


dw I y(fc„+i) 


< 


< 


0 vin/k„) 

/‘O® /*^ r / \ 2 

J e-^v j E (l + + 1 a= 


m2 


■qiri/knY 


dwdt! 


E (l + e2e25(«+|v^„|)^ 


, , 2M2 ^ 


The first summand has a finite limit thanks to Lemma C.f. The second 
summand converges to 0 as E1 a= tends to 0 exponentially fast while 
ljr]{njkn)^ tends to infinity algebraically fast. 

Bounds on (i) are easily obtained, using Jensen’s Inequality and Poincare 
Inequality. 

fc„ Var (E[ 7 (/c„) I Y(fe^+i)]) _ Var (/g°° ry (e“+’^(''"+i))e““dM) 


r]{n/kn) 


< 477(n/fc„)E 


< 4?7(n/fc„)E 


r]{n/kn) 

( /■°° 77 ^ 


' rj (e“+^<''"+i)) 
ri{nlkn) 


e-“du 


Using the line of arguments as for handling the limit of (ii), we establish that 
(i) converges to 0. 

We now check that (ill) converges towards a finite limit. Note that 


Cov 


= E 


E, 


/ 7y(e“+U'=»i + i) ) 
'o r?(n/fc„) 


dw I yik„+i) 


(E-l) 


rE ^(e«+Y{<=„+i)) 
'o vin/kn) 


dw 


By Lemma C.l, for almost every u > 0, 

r]{e^~^'^"n/kn) 


and 


1^-11 


{E-iy 


■q[e^+'^^n/kn) 


ri{n/kn) 


Vin/kn) 


du 


{E - l)e'’“, 


< 


(l + ee‘^(“+l^"l)) du + tA:^E\E - 1| 


M 


\vin/kn)\ 
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The first term is finite as the integral of a continuous function on a compact. 
Thus, 


{E-1) 


rj{e^'^^^n/kn) 
3 ■nin/K) 


du -An {E-l) / e^“dM ={E-l) 


.pE _ I 


The expected value of the last random variable is 1/(1 — p)^. 
We check that, for sufficiently large n, 


\E-1\ 

< E 

< E 


|p(e"+^"n/fc„)| 
/o \vin/kn)\ 


drt 


\E-1\J^ e^(“+'^") (l + ee^("+l’^"l)) +1aoJE-1\ 


M 


\vin/kn)\ 


dit 


:,PW„ 


2 , MW/ 

5{l-5Y 


p/ 2 e 

< de'"" + 177:;-TTTie 


M 


M 

\'n{‘n/kn)\ 

j-El^c . 


El^c 


5(1-(5)2 |? 7 (n/fc„)| 

We now way conclude by dominated convergence that 

, 27 


(III) 


n—foo (1 — p)2 


Appendix D: Proof of Proposition 4.6 

The proof of Proposition 4.3 from [Boucheron and Thomas, 2012] yields that, 
with probability larger than 1 — 5, for 0 < z, 

P |exp > 1 — exp 2fcsinh (z/2)^^ . 

We may choose z = 2 arsinh(-y/ln {1/5)/2k) and notice that 
arsinh(a;) = ln(a; + vT+lz?). This yields 

< 1 + and exp (-2k sinh (z/2)^) = 5 . 


Appendix E: Revisiting the lower bound on adaptive estimation 
error 

Lower bounds on tail index estimation error [Carpentier and Kim, 2015, Drees, 
1998a, 2001, Novak, 2014] are usually constructed by defining sequences of 
local models around a pure Pareto distribution with shape parameter 70 . 
When deriving lower bounds for the estimation error under constraints like 77 
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is regularly varying, the elements of the local model for sample size n may be 
defined by 


exp^‘ 


h{c„/s) - /i( 0 ) 


ds 


where h is square integrable over [0,1], —)■ 0, nd^/cn 1 [Drees, 2001]. The 

sequences dn and c„ are chosen in such a way that dn \ h{cn/s) — /i( 0 )| = \ri{s)\ 
satisfies the required constraint. If the local alternatives are Pareto change 
point distributions as in [Novak, 2014] and [Carpentier and Kim, 2015], 
h{x) = Cn = . Drees [2001] explores a richer collection of local 

alternatives in order to fit into the theory of weak convergence of local 
experiments. 

In order to explore adaptivity as in [Carpentier and Kim, 2015], it is necessary 
to handle simultaneously a collection of sequences ((i„, c„)„ corresponding to 
different rates of decay of the von Mises function. The difficulty of estimation 
is connected with the difficulty of distinguishing alternatives with different tail 
indices that is, with the hardness of a multiple hypotheses testing problem. In 
order to lower bound the testing error, Carpentier and Kim chose to use 
Fano’s Lemma [Cover and Thomas, 1991, see]. Using Fano’s Lemma requires 
bounding the Kullback-Leibler divergence between the different local 
alternatives which is not as easy as bounding the divergence between a Pareto 
change point distribution and a pure Pareto distribution. 

The next lemma is from [Birge, 2005]. It can be used in the derivation of risk 
lower bounds instead of the classical Fano Lemma. Just as Fano’s Lemma, it 
states a lower bound on the error in multiple hypothesis testing. However, as it 
only requires computing the Kullback-Leibler divergence to the localisation 
center, in the present setting, it significantly alleviates computations and 
makes the proof more concise and more transparent. 


Lemma E.l. (Birge-Fano) Let Pqi ■ • ■ j Pm be a collection of probability 
distributions on some space, and let Ag,... ,Am be a eollection of pairwise 
disjoint events, then the following holds 


min Pi{Ai} < 

i 


1 -f 2e ln(M -h I) 


In order to take advantage of Lemma E.l, we use the Bayesian game designed 
in [Carpentier and Kim, 2015]. 

Theorem E.2. Let 7 > 0, p < —1, and 0 < ?; < e/(l -I- 2e). Then, for any tail 
index estimator^ and any sample size n such that M = [InnJ > , there 

exists a collection {Pi)i<M of probability distributions such that 

i) Pi G MDA(7i) with 7^ > 7, 

ii) Pi meets the von Mises eondition with von Mises function iji satisfying 


V^{t) < iP"' 


where pi = p + i/M < 0, 
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in) 


maxP®" 

i<M ® 


|7-7^I > 


z; In In n 


|P.|/(1+2|P.|)' 


> 


1 


1 + 26 


and 


max Ep®n 
i<M P 


lluJiil > <^P 

7, J - 4(1 + 2 e) 




with Cp = l- exp (- 2(i+2|p|)0 ■ 


Proof of Theorem E.2. Choose v so that 0 < v < 2e/(l + 2e). The number of 

alternative hypotheses M is chosen in such a way that 

M/2 < In (n/(z) InM)) < M. If [InnJ > e^/”, M = [InnJ will do. 

The center of localisation Pg is the pure Pareto distribution with shape 
parameter 7 > 0 (Po{('r, 00)} = The local alternatives Pi,... ,Pm are 

Pareto change point distributions. Each Pi is defined by a breakpoint > 1 
and an ultimate Pareto index 7 ^. If Fi denotes the distribution function of P^, 


Fi{x)=x +TA^'^ix/n) ■ 

Karamata’s representation of (l/Fi)^ is 

U,{t) = F' exp ds 

with ■qi{s) = (7 - 7i)l{s<p/^} ■ 

The Kullback-Leibler divergence between Pi and Pg is readily calculated, 


/C(P„Po) = P,(t,) ^ - 1 - In ^ = r; 


-1/7 


^-1-ln^ 


If 7 i > 7 , the next upper bound holds. 


/C(P„Pg) < 





The breakpoints and tail indices are chosen in such a way that all upper 
bounds are equal (namely nr^ ^^^( 7^/7 — 1 )^ does not depend on z), 

Ti = ( 71 /( 1 ; In 

7i = 7 + 7 (77/(7; InM))^’^^^'''^'^*'^ 

so that /C(i/®",Po®") = n/C(Pi,Pg) < ulnM, for all 1 < i < M. 

Note that, for alH > 1, 

\Vi{t)\ = I 7 - 7»|l{t<^_i/^} < < iF' 
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the upper bound being achieved at t . 

Now, let 7 be any tail index estimator. Define region Ai, as the set of samples 
such that 7 i minimises I 7 — 7 j|, for 1 < j < M. Then, if the event Ai is not 
realised, 


1 


> - min | 7 j- 7 ,|. 


By Bilge’s Lemma, 


max 

i<M 





1 


1 + 2 e 


In order to make the whole construction useful, it remains to choose the 
“second-order parameters” pi’s (the true second-order parameter of each Pi is 
infinite!). We will need an upper bound on 7^/7 (but we already have 
li/l < 2 ), as well as a lower bound on for j ^ i that scales like 

(n/ln 

Following Carpentier and Kim [2015], we finally choose pi as pi = p + i/M for 
1 < i < M. Then, for j < i, using that M/2 < ln(n/(f InM)) < M and 
Pi - Pj = (i- j)/M, 

\l3 -li\ 

1 % 

> l7i -7i| 

27 

1 / n \ Pi/(H"2|Pi|) 

“ 2 V u In M / 

1 / n \ Pi/(l + 2|Pi|) 

“ 2 V u In M / 

1 / n \ Pi/(H" 2 |pi|) 

“ 2 V In M / 

“ 2 V z; In M / 

where Cp may be chosen as 1 — exp 2 (T+ 7 |pi) 2 ') ■ 


/ n \Pi/(i+ 2 |pd)-Pi/(i-H 2 |pd) 
VulnM/ 


1 — exp 
1 — exp 


M(l + 2|p,|)(l + 2|p,|) 

-Y-j) 


In 


\vhiM J 


2(l + 2|p,|)(l + 2|p,|);j 
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